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^ The Wang-Landau algorithm aims at sampling from a probability 

distribution, while penalizing some regions of the state space and 
^< favouring others. It is widely used, but its convergence properties are 

00 still unknown. We show that for some variations of the algorithm, 

the Wang-Landau algorithm reaches the so-called Flat Histogram 
criterion in finite time, and that this criterion can be never reached 
for other variations. The arguments are shown on an simple context - 
compact spaces, density functions bounded from both sides- for the 
sake of clarity, and could be extended to more general contexts. 



H 



1. Introduction and notations. Consider the problem of samphng 
I I from a probabiHty distribution vr defined on a measure space (Af, We 

suppose that we can compute the probability density function of vr at any 
point X G A', up to a multiplicative constant. Given a proposal kernel Q{-, •) 
lO we define a Metropolis-Hastings (MH) (Hastings, 1970; Tierney, 1998) tran- 

^ sition kernel targeting vr, denoted by K{-, •), as follows: 



d 



yx,y e X K{x, y) = Q{x, y)p{x, y) + 5n,{y){l - r{x)) 
with p{x,y) defined by p{x,y) := 1 A and r{x) defined by: 

r{x) := / p{x,y)Q{x,y)dy 
Jx 

Here the delta function 5a{h) takes value 1 when a = b and otherwise. 
Under some conditions on the proposal Q and the target vr, the MH kernel 
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defines an algorithm to generate a Markov chain with stationary distribution 
vr (Robert and Casella, 2004). 

Let us consider a partition of the state space X into d disjoint sets 
^1, • • • ) ^d- 



d 



i=l 

If we have a sample Xi , . . . ,Xt independent and identically distributed from 
TT, then for any i £ [1, d]: 



n=l 

where we denote by 11;^. (x) the indicator function that is equal to 1 when 
X £ Xi and otherwise. Similar convergence is obtained when Xi, . . . ,Xt 
is an ergodic chain such as the one generated by the MH algorithm. The 
purpose of the Wang-Landau algorithm (Wang and Landau, 2001a, b; Liang, 
2005; Atchade and Liu, 2010) is to obtain a sample 

• such that for any i £ [l,d] the subsample 

{Xn for n G [l,t] s.t. X„ G Xi} 

is distributed according to the restriction of n to Xi, and 

• such that for any i £ [l,d] 

t ^ — ' t— s>oo 
n=l 

where cp = (cpi, . . . , (pd) is chosen by the user, and could be any vector 
in ]0, if such that Y.i=i = 1- 

A typical use of this algorithm is to sample from multimodal distributions, 
by penalizing already- visited regions and favouring the exploration of regions 
between modes, in an attempt to recover all the modes. 

This algorithm, in the class of Markov Chain Monte Carlo (MCMC) algo- 
rithms (Robert and Casella, 2004), therefore allows to learn about vr while 
"forcing" the proportions of visits 0j of the generated chain to any of the 
sets Xi, which are typically also chosen by the user. The vector cpi, . . . , (pa 
might be referred to as the "desired frequencies" , and the sets Xi are called 
the "bins". In a typical situation, the mass of vr over bin Xi, which we de- 
note by Ipi, is unknown, and hence one cannot easily guess how much to 
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"penalize" or to "favour" a bin Xi in order to obtain the desired frequency 
(pi. The Wang-Landau algorithm introduces a vector 6t = (^t(l), • • • , Ot[d)), 
referred to as "penalties" at time t, which is updated at every iteration i, 
and which acts like an approximation of the ratios ipi/(pi, . . . , ipd/<Pd, up to 
a multiplicative constant. 

For a distribution vr and a vector of penalties 9 = {6(1), . . . ,6{d)), we 
define the penalized distribution ttq: 

TTg{x) OC 7r(x) X — — 
i=l 

To be more concise we define a function J : X i— )• {1, . . . ,d} that takes a 
state X G X and returns the index i of the bin such that x £ Xi. We 
can now write: 7r0{x) oc it{x)/9{J{x)). We will denote by Kq the MH kernel 
targeting -kq. 

The Wang-Landau algorithm, described in the next section, alternates 
between generating a sample by targeting -kq using Kq, and updating 6 using 
the generated sample. In this sense it is an adaptive MCMC algorithm (past 
samples are used to update the kernel at a given iteration) , using an auxiliary 
chain {Ot), and therefore the behaviour of the sample is not obvious. 

The Wang-Landau algorithm is widely used in the Physics community 
(Silva, Caparica and Plascak, 2006; Malakis, Kalozoumis and Tyraskis, 2006; 
Cunha Netto et al., 2006). In particular, many practitioners use fiavours of 
the algorithm with a "Flat-Histogram" criterion. However, its convergence 
properties are still partially unknown. We show that this criterion is reached 
in finite time for some variations of the algorithm. This result is all that was 
missing to apply results on adaptive algorithms with diminishing adaptation 
(Fort et al., 2011). 

In Section 2, we define variations of the Wang-Landau algorithm. We then 
introduce ratios of penalties and argue for their convenience in studying the 
properties of the algorithm. We prove in Sections 3 and 4 that under certain 
conditions, the Flat Histogram criterion is met in finite time, for the cases 
d = 2 and d > 2 respectively. The result is illustrated in Section 5, and in 
Section 6, we hint at how our assumptions might be relaxed. 

2. Wang-Landau algorithms: different flavours. There are several 
versions of the Wang-Landau algorithm. We describe the general version 
introduced by Atchade and Liu (2010), both in its deterministic form and 
with a stochastic schedule. 

2.1. A first version with deterministic schedule. Let {'yt)tGN (referred to 
as a schedule or a temperature) be a sequence of positive real numbers such 
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that: 

Et>o 7t =00 

Et>o It < 00 

A typical choice is 74 := with a g]0.5, 1[. The Wang-Landau algorithm 
is described in pseudo-code in Algorithm 1. In this form, the schedule 74 
decreases at each iteration, and is therefore called "deterministic" . 

Algorithm 1 Wang-Landau with deterministic schedule 

1: Init Vi e {1, . . . ,d} set eo{i) 1/d. 

2: Init Xo G X. 

3: for t = 1 to T do 

4: Sample Xt from 7^9^ ^ (Xf_i, •), MH kernel targeting i^Bt-i- 

5: Update the penalties: \og6t{i) log6lt_i(j) + f{1xi{Xt),4)i,'yt)- 

6: end for 



Step 5 of Algorithm 1 updates the penalties from 9t-i to 6t, by increasing 
it if the corresponding bin has been visited by the chain at the current iter- 
ation, and by decreasing it otherwise. This rationale seems natural, however 
we did not find any article arguing for a particular choice of update, among 
the infinite number of updates that would also follow the same rationale. In 
other words, it is not obvious how to choose the function /, except that it 
should be such that it is positive when Xt G Xi and such that it is closer to 
when 7f decreases, to ensure that the penalties converge. Some practitioners 
use the following update: 

(1) logOtii) ^\oget^i{i) + ^ti-^x^iXt) - <t)^) 

while others use: 

(2) \ogetii) ^ log0t_i(O + log [1 + It {^x{Xt) - <l^i)] 

Since 7t converges to when t increases, and since update (1) is the first- 
order Taylor expansion of update (2), one legitimately expects both updates 
to result in similar performance in practice. We shall see in Section 3 that 
this is not necessarily the case. 

Some convergence results have been proven about Algorithm 1: the de- 
terministic schedule ensures that 6t changes less and less along the itera- 
tions of the algorithm, and consequently the kernels Kq^ change less and 
less as well. The study of the algorithm hence falls into the realm of adap- 
tive MCMC where the diminishing adaptation condition holds (Andrieu and 
Thoms, 2008; Atchade et al., 2009; Fort et al., 2011), although it is original 
in the sense that the target distribution (vre^ ) is adaptive but not necessarily 
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the proposal distribution Q. See also the literature on stochastic approxi- 
mation (Andrieu, Moulines and Priouret, 2006). 

In this article we are especially interested in a more sophisticated ver- 
sion of the Wang-Landau algorithm that uses a stochastic schedule, and for 
which, as we shall see in the following, the two updates result in different 
performance. 

2.2. A sophisticated version with stochastic schedule. A remarkable im- 
provement has been made over Algorithm 1: the use of a "Flat Histogram" 
(FH) criterion to decrease the schedule only at certain random times. Let 
us introduce the number of generated points at iteration t that are in 



n=l 



For some predefined precision threshold c, we say that (FH) is met at iter- 
ation t if: 

vt{i) 



max 
ie{i,...,4 



t 



< c 



Intuitively, this criterion is met if the observed proportion of visits to each 
bin is not far from (j), the desired proportion. The name "Flat Histogram" 
comes from the observation that if the desired proportions are all equal to 
1 /d, this criterion is verified when the histogram of visits is approximately 
flat. The threshold c could possibly decrease along the iterations, to get an 
always finer precision. 

The Wang-Landau with Flat Histogram (Algorithm 2) is similar to the 
previous algorithm, with a single difference: the schedule 7 does not decrease 
at each step anymore, but only when (FH) is met. To know whether it is 
met or not, a counter I't of visits to each bin is updated at each iteration, 
and when (FH) is met, the schedule decreases and the counter is reset to 0. 

Note the difference between Algorithms 1 and 2: 7 is indexed by k instead 
of t, and K is a random variable. As with Algorithm 1, the update of penalties 
(step 13 of Algorithm 2) can be either update (1) or update (2), or possibly 
something else. Interestingly in this case, it is not obvious anymore that 
both updates will give similar results. Indeed, for 7^ to go to 0, we need 
(FH) to be reached in finite time, so that k regularly increases. 

This flavour of the Wang-Landau algorithm is widely used in the Physics 
literature (Cunha Netto et al., 2006; Silva, Caparica and Plascak, 2006; 
Malakis, Kalozoumis and Tyraskis, 2006; Ngo and Diep, 2008). 

Our contribution is to show in a simple context that update (1) is such 
that (FH) is met in finite time, while (2) is not so. Hence only using update 
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Algorithm 2 Wang-Landau with Flat Histogram 
1: Init Vi G {1, . . . ,d} set 6o{i) <- 1/d. 
2: Init Xq & X. 

3: Init k = 0, the number of (FH) criteria already reached. 
4: Init the counter Vi G {1, . . . , d} i^i(z) 
5: for t = 1 to T do 

6: Sample Xt from Tfe^ j (Xt_i, •) targeting Tra^ ^. 
7: Update i/t: Vi G {1, ... ,d} i^t (i) ^ !/t_i (i) + Ia-, (Xt) 
8: Check whether (FH) is met. 
9: if (FH) is met then 
10: 

11: ViG{l,...,d} vt(i)^Q 

12: end if 

13: Update the bias: loge't(i) 4- log 6lt_i(i) + /(I^-, (Xt), 0^, 7«). 
14: end for 



(1) can one expect the convergence properties of Algorithm 1 to still hold 
for Algorithm 2, since if (FH) is met in finite time a sort of diminishing 
adaptation condition would still hold. 

To underline the difficulty of knowing whether (FH) is met in finite time 
or not, let us recall that between two (FH) occurrences, the schedule is con- 
stant (equal to some 7^ > 0), hence the penalties {6t) change at a constant 
scale and diminishing adaptation does not directly hold. Other adaptive al- 
gorithms share this lack of diminishing adaptation, as e.g. the Accelerated 
Stochastic Approximation algorithm (Kesten, 1958), in which the adapta- 
tion of some process (At) diminishes only if its increments change sign. In 
our case, (FH) will be reached if the chain (Xt) lands with frequency (pi in 
each bin Xi (see Corollary 2). 

Note that in the implementation of the algorithm, the penalties 6t need 
only be defined up to a normalizing constant, since they only appear in 
ratios of the form 9t{i)/9t{j). We therefore introduce the following notation: 

Vi, j e {1, . . . , 4 such that i / j = log ^ 

(i i) 

and we note Zt the collection of all the . Some intuition behind the 
study of such ratios comes from considering update (1). With this update, 
assume that for each i, K[RxiiXt)] = (pi- Then we could easily check that 
for each pair (i, j), E[Z^^*''''*|z|!^{^] = ^t!^{\ so this process would be constant 
on average. The remainder of this paper hinges on two facts: that we can 
control (Zt), in the sense that Z^'''^ jt — t- 0; and that if we control (Zt), then 
we control the frequencies of visits [ut/t). 



imsart-aap ver. 2011/11/15 file: FlatHistogramInFiniteTime.tex date: August 29, 2012 



WANG-LANDAU FLAT HISTOGRAM IN FINITE TIME 



7 



More generally, notice that with fixed 7, the pair {Xt,Zt) forms a ho- 
mogeneous Markov chain. If we could prove that this chain is irreducible, 
then it would imply that its proportion of visits to the set Xi x M'^('^~^) 
converges to some value in [0,1]. We would then need to check that the 
limit is indeed the desired frequency for all i. Unfortunately, properties 
of the joint chain {Xt, Zt) are difficult to establish due to the complexity of 
its transition kernel. Finding a so-called drift function for the joint Markov 
chain is also typically difficult. In general, we are not able to show that the 
chain is irreducible. In section 3, we prove directly that zl'"^ /t — )• in the 
special case d = 2, under some assumptions. In section 4, we make more 
restrictive assumptions which imply irreducibility. In both cases, we show 
the implication of this convergence on the frequencies of visits. 

3. Proof when d=2. In the following we consider a simple context 
with only two bins: d = 2 and Xf can therefore only be either in Xi or in 
X2. Suppose the current schedule is at 7 > 0, and we want to know whether 
(FH) is going to be met in finite time (hence 7 is fixed here). To simplify 
notation, in this section we note 

= Zp) = log 0^(1) -log 0^(2) 

Using the definition of the penalties (Of) and of the counts (I't), we obtain 

Zt = Zo + Wta)f{i, 01, 7) + (t - M^))f{o, 01, 7)] 

- M2)f{l, ^2,1) + {t- yt{2))f{Q, 02, 7)] 

= Zo + ut{l) [/(I, 01, 7) - /(O, 01, 7)] + t/(0, 01, 7) 

- [{t - ut{l))f{l, 02, 7) + ^t{l)f{Q, 02, 7)] 

= Zo + Ut{l) [/(I, 01, 7) - /(O, 01, 7) + /(I, 02, 7) - /(O, 02, 7))] 

+ t(/(O,0i,7)-/(l,02,7)) 

If we prove that Zt/t goes to (for instance in mean), this will imply the 
following convergence of the proportion of visits: 

^t(l) ^ /(I, 02, 7) -/(O, 01, 7) 

t t^^ /(l,0i,7)-/(O,0i,7) + /(l,02,7)-/(O,02,7) 

(also in mean). Since we want (FH) to be reached in finite time for any 
precision threshold c > 0, we need the proportions of visits to Xi to converge 
to 01. Hence we want: 

/(l,02,7)-/(O,0i,7) ^ , 

^ ' /(I, 01, 7) - /(O, 01, 7) + /(I, 02, 7) -/(O, 02, 7) 
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Using the specific forms of /(ll;^'. (Xt), 7) for botli updates, we can easily 
see that 

• update (1) satisfies equation (3) for any cp and 7; 

• in general, update (2) does not satisfy equation (3), except in the 
special case where (/>i = 02 = 1/2. 

The rest of the paper is devoted to the proof that Zt/t goes to under 
some assumptions. More formally, we state in Theorem 1 what we shall prove 
in the remainder of this section. This theorem holds for both updates. 

Theorem 1. Consider the sequence of penalties {Ot) introduced in Al- 
gorithm 2. We define: 

Zt = \oget{i)-\oget{2) 

Then: 

t >oo 

As a consequence, the long run proportion of visits to each bin converges 
to the desired frequency 4> for update (1), and not necessarily for update 
(2). Corollary 2 clarifies the consequence of Theorem 1 on the validity of 
Algorithm 2. 

Corollary 2. When the proportions of visits converge in mean to the 
desired proportions, the Flat Histogram criterion is reached in finite time for 
any precision threshold c. 

We already made the simplification of considering the simple case d = 2. 
We make the following assumptions: 

Assumption 1. The bins are not empty with respect to fi and tt: 

yi G {1, 2} > and TT{Xi) > 

Assumption 2. The state space X is compact. 

Assumption 3. The proposition distribution Q(x,y) is such that: 

^Qmin > Wx G X Wy £ X Q{x, y) > qmin 

Assumption 4. The MH acceptance ratio is hounded from both sides: 

3m >0 3M>0 yx^X Vy G m < IlM ^j^'^j < m 

7r{x) Q{x,y) 
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Assumption 1 guarantees that the bins are weU designed, and if it was 
not verified, the algorithm would never reach (FH), regardless of the other 
assumptions. Assumptions 2-4 are for example verified by a gaussian random 
walk proposal over a compact space, where there is a lower bound on tt. We 
believe that these assumptions can be relaxed to cover the most general 
Wang-Landau algorithm. Making these four assumptions allows to propose 
a clearer proof, and we propose hints on how to relax them in Section 6. 

We denote by Ut the increment of Zt, such that for any t: 

Zt+i = Zt + Ut = Zt + f{ixAXt),h,i) - f{^x2{Xt),(t>2,i) 

Here with only two bins, the increments Ut can take two different values, 
+a or —b, for some a > and 6 > that depend on (p and 7. For example, 
with update (1) : 

fa = 27(1 - > 
\b = 27(/.i > 

whereas with update (2) : 

and in both cases, if Xt G A'l then Ut = +a, otherwise Ut = —b. 

We want to prove that Zt/t goes to 0, and we are going to prove a stronger 
result that states, in words, that when Zt leaves a fixed interval [Z^°,Z^'^], 
it returns to it in a finite time. 

3.1. Behaviour of (Zt) outside an interval. First, lemma 3 states that if 
Zt goes above a value Z^^, it has a strictly positive probability of starting 
to decrease, and that when that happens, it keeps on decreasing with a high 
probability. 

Lemma 3. With the introduced processes Zt and Ut, there exists e > 
such that for all rj > 0, there exists Z^^ such that, if Zt > Z^^, we have the 
following two inequalities: 

P[Ut+i = -b\Ut = +a, Zt] >e 
P[Ut+i = -b\Ut = -b,Zt]>l-v- 

Proof of Lemma 3. We start with the first inequality. Let qmin be hke 
in Assumption 3. 
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In terms of events {Ut = +a} is equivalent to {Xt G Afi}, by definition. If 
Xt G Xi and TT{Xt) > then: 



KeAXt,X2) 



Ke,{Xt,y)dy 



X2 



X2 



Q{Xuy)pet{Xt,y)dy 



Q{Xt,y) {l^ 
Q{Xt,y) {l^ 



-^{y) Q{y,Xt)et{J{Xt)) 
TT{Xt)Q{Xuy) et{J{y)) 
7:{y) Q{y,Xt) ^z, 



dy 



7r{Xt)Q{Xt,y) 



dy 



Using Assumption 4, 'qixy) bounded from below, hence there exists 



Ki such that: 



If Zt > Ki and E Afi, then: 
KeAXt,X2) = 
Hence \i Zt > Ki: 



7r(x) Q{x,y) 
Q{Xt,y)dy > qminfJ'{X2). 



X2 



P[Ut+i = -b\Ut = +a, Zt] = P[Xt+i G X2\Xt £ X,, Zt] 

> QminfJ'{X2)- 

We now prove the second inequality. Let us show that for any rj > there 
exists K2 such that, provided Zt > K2: 

P[Ut+i = -b\Ut = -b,Zt] > 1-r?. 

We have 

P[Ut+i = -b\Ut = -b, Zt] = P[Xt+i G X2\Xt G X2, Zt]. 
Again let us first work for a fixed Xt £ X2. 
Ke,{Xt,X2) = l-Ke,{Xt,X^) 



Q{Xt,y)petiXt,y)dy 



niv \ ^ 7r(7/) Q{y,Xt) _z, , , 
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Using the assumption that the MH ratio is bounded from above, 

there exists K2 such that: 

7r(x) Q{x,y) 

And hence for Zt > K2: 

> 1-e-^* / Q{Xt,y)Mdy 

JXi 

> 1 - Me'^* 



and hence for any ry, there is a greater than K2 such that for all > K3: 

Kg,{Xt,X2) >l-V 

We thus obtain: 

P[Ut+i = -b\Ut = -b,Zt]>l-r]. 

To conclude we finally define e = qrninfJ'i'^2) and then for any 77 > 0, by 
taking any Z^^ greater than Ki V we have both inequalities. □ 

Considering the symmetry of the problem, we instantly have the following 
corollary result. It states that if Zt goes too low, it has a strictly positive 
probability of starting to increase, and when that happens, it keeps on in- 
creasing with a high probability. 

Lemma 4. With the introduced processes Zt and Ut, there exists e > 
such that for all rj > 0, there exists Z^° such that, if Zt < Z}" , we have the 
following two inequalities: 

P[Ut+i = +a\Ut = -b,Zt] > e 
P[Ut+i = +a\Ut = +a, Zt]>l- 77. 

3.2. A new process that bounds (Zt) outside the set. In this section, the 
proof introduces a new sequence of increments Ut that bounds Ut, and such 
that the sequence Zt using Ut as increments: 

Zt+i = Zt + Ut 
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Fig 1. Trajectory of Z (full line with dots) and of Z (dotted line), when these processes 
go above some level Z'^' indicated by a horizontal full line. Z goes above the level at time 
s, and returns below it at time s + T, whereas Z stays above the level until time s + T , 
with T <f. 

returns to [Z^°, Z^^] in a finite time whenever it leaves it. It will imply that 
Zt also returns to [Z^°,Z^^] in finite time whenever it leaves it. Figure 1 
might help to visualize the proof. 

First let us use Lemma 3. We can take e < 1/2 and rj < min(l/2, eh /a). 
The Lemma gives the existence of an integer K such that if Zt > we 
have the following two inequalities: 

(4) P[Ut+i = -h\Ut = +a,Zt]> e 

(5) P[Ut+i = -b\Ut = -b, Zt]> 1 - V- 

Suppose that there is some time s such that Zg^i < K and Zs > K. Note 
that necessarily Zg £ [i^, K + a\. Then we define Z<j = , a new process 
starting at time s. Let s + T be the first time after s such that Zg+T < K. 
We wish to show that E\T] < oo. 

We define the sequence of random variables {Zt)^^^ defined by Zg = Zs 

and = Zt + Ut for t > s, where [Ut] is a sequence of random 

variables taking the values +a or —b. 
For s < t <T, Ut IS defined as follows: 

• if Ut+i = +a then Ut+i = +a; 

• if Ut-^-i = —b, Ut = —b and Ut = —b then Ut+i = —b with probability 
pi = {1 — r])/P[Ut+i = —b\Ut = —b, Zt] and Ut+i = +a otherwise; 
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• if Ut+i = —b, Ut = +a and Ut = +a, then Ut+i = —b with probabihty 
P2 = e/P[Ut+i = —b\Ut = +a, Zt] and Ut+i = +a otherwise; 

• if Ut+i = —b, Ut = —b and Ut = +a, then Ut+i = —b with probabihty 
P3 = e(l + P[Ut+i = +a\Ut = -b, Zt]/P[Ut+i = -b\Ut = -b, Zt]) and 
Ut+i = +a otherwise. 

For times t > T, Ut is a Markov chain independent of Ut and Zt, with 
transition matrix 

■ 1-e e 
r] 1 — rj 

where the first state corresponds to +a and the second state to —b. 

First, let us check that all these probabilities are indeed less than 1. For 
pi, it follows from inequality (5). For p2, it follows from inequality (4). For 
Ps, we have 

V + P[Ut+i = = -6, Z J - ^ V + 1 - J - - 

where we used the conditions r] < 1/2 and e < 1/2. Hence (Ut) is well 
defined. 

Lemma 5. (Ut) is a Markov chain over the space {+a,—b} with transi- 
tion matrix 

' 1-e e 
r] 1 — rj 

where the first state corresponds to {+«} and the second state to {—b}. 

Proof of Lemma 5. We only need to check this for times t < T. The 
events {Ut = —b} and {Ut = —b,Ut = —b} are identical, hence: 

P[Ut+i = -b\Ut = -b, Zt] = P[Ut+i = -b\Ut = -b, Ut = -b, Zt] 

= P[Ut+i = -b]Ut+i = -b, Ut = -b, Ut = -b, Zt] 

X P[Ut+i = -b]Ut = -b, Ut = -b, Zt] 
^ {l-7j)P[Ut+i = -b]Ut = -b,Zt] 

P[Ut+i = -b]Ut = -b, Zt] 
= 1 — rj. 

Note that this does not depend on Zt- 
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Similarly: 

P[Ut+i = -b\Ut = +a, Ut = +a, Zt] 

= P[Ut+i = -b\Ut+i = -6, Ut = +a, Ut = +a, Zt] 

X P[Ut+i = -b\Ut = +a, Ut = +a, Zt] 
^ eP[Ut+i = -b\Ut = +a,Zt] 
P[Ut+i = -b\Ut = +a,Zt] 
= e. 



b\Ut = +a,Ut = -b,Zt] 
= P[Ut+i = -b\Ut+i = -b, Ut = +a, Ut = -b, Zt] 
X P[Ut+i = -b\Ut = +a, Ut = -b, Zt] 
f P[Ut+i = +a\Ut = -b,Zt] \ 

'V p[Ut+i = -b\Ut = -b,Zt]) 

xP[Ut+i = -b\Ut = -b,Zt] 
= e{P[Ut+i = -b\Ut = -b, Zt] + P[Ut+i = +a\Ut = -b, Zt]) 
= e. 

These last two calculations result in: 

P[Ut+i = -b\Ut = +a] = 
with no dependence on Zt (or Ut). □ 

The previous lemma is central to the proof, and especially the lack of 
dependence on Zt- We always have Ug = +a, since Us = +a. Hence for each 
t > s, the distribution of Ut depends only on r] and e, and implicitly on the 
threshold K, but not on the value of Zg. Hence (Ut) has the same law, every 
time the process (Zt) goes above K. 

3.3. Conclusion: proof of Theorem 1 and Corollary 2. Let us now use 
the bounding process (Zt) to control the time spent by (Zt) above K. 

Lemma 6. There exists r G M such that, for all times s such that Zg^i < 
K and Zs > K , and defining T by T = m{d>o{Zs+d < K}, then: 

E[T] < T 



And: 

P[Ut+i = - 
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Proof of Lemma 6. The Markov chain (Ut) admits the fohowing sta- 
tionary distribution: 



^r7 

\€ + 1] e + T] ^ 

Let us denote by T the time spent by (Zt) over K, that is: 

s+d 

f = inf { Ut< -a} 

- t=s+l 

Remember that = € [K, K + a], hence Z^_^j, < K (whatever the value 
of Zg). Now, our choice of 77 resuhs in arj < he which implies E\T] < 00 
(Norris, 1998). Let r = E[r]. Note that since the law of {Ut) does not 
depend on the value of Z^, r does not depend on Zg. 

Since, for t < T, we impose that "if Ut+i = +a then Ut+i = +a", it 
follows that \/t < T,Ut < Ut- Consequently Vt < T,Zt < Zt and hence 
T <T. Note that (the distribution of) T depends on the exact value of Z^, 
but that T as we have defined it has a fixed distribution. We have E\T] < t 
(whatever the value Zg). □ 

Proof of Theorem 1. Let us define the following sequence of indices: 
Si = inf{Zs_i < K and Zg > K} ; = inf {Zg-i < K and Zg > 

s>0 s>Sk-i 

The sequence (5^) represents the times at which the process (Zt) goes above 
K. Moreover let us introduce the sequence of time spent above K: 

Tk = m.i{Zs^+g-i > K and Zs^+g < K} 

We have Zg^. G [K,K + a]. Define k{t) such that S'fc(t) <t< S^f^f^^i. Either 
Zt < K or Zt > K. In the latter case, Zt < -^5^(4) + '^^fc(t)- Clearly in any 
case: 

(6) E[Zt] <{K + a) + ar. 

A similar reasoning on the the lower bound leads to K' and r' < 00 such 
that 

(7) E[Zi] > {K' - 6) - hr'. 
Inequalities (7) and (6) imply 

'Zi 
_ t ^ 

□ 



E 
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As stated at the beginning of the section, for update (1) the convergence 
Zijt —7- (in mean) imphes the convergence of the proportions {ft/t) to (p 
(also in mean). We now show that this ensures that the Flat Histogram is 
reached in finite time. 

Proof of Corollary 2. For a fixed threshold c, recall that (FH) being 
reached at time t corresponds to the event: 



FHt = {ViG{l,...,4 



t 



<c} 



We will only use the convergence in probability of the proportions to (f) for 
all i: 

Mi) F . , 
— — > 9i 

t t— >oo 

which implies: 

Ve > 3Af G N Vt > iV ¥[FHt) > 1 - e 

We can hence define a stopping time T™ corresponding to the first (FH) 
being reached: 

T™ = ini{FHt\ 

and some e > such that: 

G N Vn > iV P(r™ < + n) > e 

Using Lemma 10.11 of Williams (1991), the expectation of is then 
finite. □ 

4. Proof when d > 2. In this section we extend the proof to the more 
general case d > 2. Having proved that for d = 2, only update (1) is valid, 
we now focus on this update and omit update (2). 

We consider the log penalties defined for update (1) by: 

log 6't(i) = ft(i)(l - 0i) - (t- vt{i))(t)i = vt{i) - t(j)i 

where ft(i) is the number of visits of {Xt) in Xi. We assume without loss of 
generality that log^o = 0. Then {Xt,\og9t) is a Markov chain, by definition 
of the WL algorithm. We first prove that [Xt^logOt) is A-irreducible, for a 
sigma-finite measure A. We will require the following additional assumption 
on the desired frequencies (p. 
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Assumption 5. The desired frequencies are rational numbers: 
Lemma 7. Let Q he the following subset o/M'^; 

d 

Q = {z gR'^ : 3(ni, . . . , n^) e N'^ Zi = m - (piSn where Sn = ^ nj} 

i=i 

Then denoting by A the product of the Lebesgue measure ^ on X and of the 
counting measure on 0, {Xt,\og9t) is X-irreducible. 

Proof. The proof essentially comes from Bezout's lemma, and is detailed 
in the Appendix. Note however that it relies on Assumption 5, that was not 
required for the case d = 2. Although not a very satisfying assumption, which 
is likely not to be necessary for proving the occurrence of (FH) in finite time, 
it seems to be necessary for the irreducibility of (Xj,log0t), at least with 
respect to a standard sigma-finite measure. In any case, this assumption is 
not restrictive in practice. □ 

Since this chain is A-irreducible, the proportion of visits to any A- measurable 
set of A'xG converges to a limit in [0, 1]. This implies that the vector {vt{i)/t) 
converges to some vector {pi). The following is a reductio ad absurdum. 

Suppose that for some i £ {1, . . . ,d}, pi ^ cpi. Since the vectors p and (j) 
both sum to 1, this means that for some i, pi < (/){: such a state i is visited 
less than the desired frequency. 

Let {ii,i2,...} = argmin;^<j<^(pj — (pj). Then for any and for j ^ 
{ii, Z2, . . .}, we have 

Z^/'' = -vt{ik) + vtii) + t{<i)i^ - ~ t{-pi^ + + pj - oo 

This implies: 

> 3T G N Vt > T > K 

Now consider the stochastic process (Ut) such that 

. Ut = -biiJ{Xt)€{il,i2,...} 

• Ut = +a otherwise 

for some real numbers a and b. Recall that the function J is such that if 
Xt G Xi then J{Xt) = i. 

Let e be such that when Xt ^ Xi-^ U Xi2 U • • • , there is probability at 
least e of proposing in Xi-^ U Xi^ U • • • . For large enough K, these proposals 
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will always be accepted. As before, for large enough K, we can make the 
probability ry of leaving Xi^ U Xi^ U • • • as small as we wish. 

Using the exact same reasoning as in Section 3, we can construct a process 
{Ut) which is a Markov chain with thansition matrix 




and with Ut < Ut almost surely. Therefore for t > T, (Ut) decreases on av- 
erage, hence (Z^'^'') decreases on average, which contradicts the assumption 
that it goes to infinity. Hence for all i, pi = (pi. 

5. Illustration of Theorem 1 on a toy example. Let us show 
the consequences of Theorem 1 on a simple example. We consider as the 
target distribution the standard normal distribution truncated to the set 
X = [—10, 10]. We use a Gaussian random walk proposal, with unit standard 
deviation. Finally we arbitrarily split the state space in Xi = [—10,0] and 
X2 =]0, 10], and we set the desired frequencies to be (/> = (0.75, 0.25). Figure 
2 shows the results of the Wang-Landau algorithm. Using update (1) and 
200, 000 iterations, we obtain the histogram of Figure 2(a). Figure 2(b) shows 
the convergence of the proportions of visits to each bin, using update (1). 
The dotted horizontal lines indicate (j), and we can check that the observed 
proportions of visits converge towards it. 

Figure 2(c) shows a similar plot, this time using update (2). Again, the 
desired frequencies are represented by dotted lines. Using the left hand side 
of equation (3), we can calculate the theoretical limit of the observed pro- 
portion of visits in each bin, which for 7 = 1 and (p = (0.75, 0.25), is approx- 
imately equal to (0.79,0.21). Hence for a precision threshold c equal to e.g. 
1%, the occurrence of (FH) is not likely to occur if one uses update (2). 

As expected, update (1) leads to convergence to the desired frequencies 
but update (2) does not. 

6. Discussion. As seen in Theorem 1 and Corollary 2 of Section 3, 
update (1) is valid, in the sense that the frequencies of visits of the chain 
{Xt) converges towards (p. Consequently (FH) is met in finite time, for any 
threshold c > 0. 

Regarding the proof of Theorem 1 in the case d > 2, we assume that the 
desired frequencies (j) are rationals (Assumption 5), which allows to prove 
that the Markov chain generated by the algorithm (Xt,Zt) is A- irreducible 
for some sigma-finite measure A. However, our proof requires mainly that the 
proportions of visits of {Xt) to any bin Xi converge, which is equivalent to 
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DiO.O 



DiO.O 



1000 2000 3000 4000 5000 

iterations 



1000 2000 3000 4000 

iterations 



(a) Histogram of the gener- (b) Convergence of the pro- (c) Convergence of the pro- 
ated sample portions of visits to each bin, portions of visits to each bin, 

using the right update using the wrong update 



Fig 2. Results of the Wang-Landau algorithm using two different updates of the penalties. 
Histogram of the generated sample using update (1), with a vertical line showing the binning 
(left). Convergence of the proportions of visits to each bin, using update (1) (middle) and 
ustng update (2) (right). The dotted horizontal lines represent the desired frequencies. 



the convergence of (Zt/t). We believe that results on Random Walks in Ran- 
dom Environments (Zeitouni, 2006) would allow to remove the rationality 
assumption. 

Assumptions 2-4 could be relaxed by using the well-known properties of 
the Metropolis-Hastings algorithm, from which we did not take advantage 
here. More precisely, note that the Wang-Landau transition kernel differs 
from the Metropolis-Hastings only when the proposed points, generated 
through Q{-, •), land in a different bin than the current position of the chain. 
Otherwise, the kernel behaves like a Metropolis-Hastings targeting vr. Hence 
under some weaker assumptions than the one we have formulated here, it 
has recurrence properties. 

To conclude, we have shown that for fixed 7, the Flat Histogram criterion 
is reached in finite time for certain updates. For other updates, the observed 
frequencies do not converge to the desired frequencies, and so there is a non- 
zero probability that the Flat Histogram criterion will never be verified. Note 
that we do not make any claims about the distribution of the sample inside 
each of the bins Xi at fixed 7. 

Acknowledgements. The authors thank Luke Bornn, Arnaud Doucet, Eric 
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APPENDIX A: PROOF OF LEMMA 6 

Let 6 C M*^ be the set of possibly reachable values of the process (log 6t). 
We define it by: 

d 

Q = {z eW^ : 3(ni, . . . , n^) G N*^ Zi = Ui - (piSn where S'n = ^ rij} 

i=i 

We want to prove the existence of a measure Xon XxQ such that the Markov 
chain {Xt,log6t) is A-irreducible. Denote by // the Lebesgue measure on X 
and let A G B{X) such that IJ,{A) > 0, and let z* G 9. Let us show that for 
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any time s at which Xg = Xg £ X and log^s = Zg G ©, there exists t > 
such that Xg^t £ ^ and log 0^+* = with strictly positive probability. This 
will prove the A-irreducibility of {Xt, log 9t) where A is the product of the 
Lebesgue measure n on X and the counting measure on 0. 

Note first that for any n = (ni, . . . ,nd) S N'^, the process (Xt) can visit 
exactly times each set Xi (for all i) between some time s+1 and some time 
s + Yli=i ™«> since there is always a non-zero probability of Xt+i visiting any 
Xi given Xf and log^^ (using Assumptions 3 on the proposal distribution 
and the form of the MH kernel). More formally, given any n E N'^ and any 
time s, denoting Sn = Yli=i ^i- 



Furthermore since ^i{A) > and since {Xi)f^^ is a partition of X (sat- 
isfying Assumption 1 on non-empty bins), there exists B C A such that 
B C Xi* for some i* £ {1, . . . ,d} and /u(i?) > 0. We are going to prove the 
following statement, which means that there is a "path" between any pair 
of points in 0: 

Lemma 8. 



Then we will conclude as follows: the Markov chain can go from any 
{xg,Zs) to some {xg+t-i, Zg+t-i) where Zg-^.t-l can be anywhere in 0, and 
then in one final step to {xg+t, -Zs+t) such that Xg^t S B and z^+i = z*, since 
Zg+t-i can be chosen such that Zg+t = z* when Xg-^-t £ B C Xi*. 

Proof of Lemma 8. The structure of the proof is the following: we 
prove that (log^t) can go from to 0, then from any z G to 0, and 
the possibility of going from to any z £ Q comes from the definition of 0. 

Suppose that log^o = (0, ... ,0), and let us prove that the process (log6'() 
can go back to 0, ie let us find a vector n = (ni, . . . , n2) G N*^ such that 





d] 




Vi G { 1 , . . . , (i} = Tij 



d 

(j)iSn where 5^ = Uj 
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Under the rationality assumption on (f) (Assumtion 5), there exists (oj, . . . , ad) G 
N"^ and 6 G N such that (pi = ai/b for all i. Now define n G N"^ as follows: 



d ^ 

Vi G {1, . . . , d} Ui = k If — 



/ O7 



where A; G N is such that rij G N for all i. Then, using Yl'j=i ~ ^ o'^^ 
readily check that: 



Vi G { 1 , . . . , d} Ui 




Hence the vector n defines a possible path for (log 9t) between and 0, in 



Sn = Yl'j=i "-i steps, with a strictly positive probability (using Equation 



(8) ). 

A similar reasoning allows to find a possible path from any z G to 0. 
For such a z G 0, there exists {mi, . . . , md) G N'^ such that 

d 

(9) Vi G {1, . . . , Zi = rrii — SmO-i/b where Sm = nij 

We wish to show that there exits {ki, . . . , kd) G N'^ such that ki — SkCii/b = 
—Zi for all i, where Sk = Yl'j=i^j- To construct {ki, . . . , kd), we use the 
already introduced vector (ni, . . . ,nd) such that Ui — Sndi/b = for all i, 
where Sn = X]j=i % ■ Putting this together with (9), we get for any C G N: 

(10) -Zi + C*{) = -mi + C*ni-^{CSn-S^) 

b 

For C large enough, for alH, Cni — rrii > 0. We simply take ki = Crii — mi 
for all i. This proves that starting from a point z €z Q (by definition reachable 
from 0), (logOf) can reach again. □ 
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